refactor: give parquet CDC options an explicit enabled flag#22632
refactor: give parquet CDC options an explicit enabled flag#22632kszucs wants to merge 2 commits into
enabled flag#22632Conversation
3e66f6e to
30090c1
Compare
|
I had second thoughts about the cdc options configuration and actually found some weird implicit behavior (e.g. order dependent configuration), so I switched to a more verbose but explicit one (I also plan to follow this approach in iceberg). @alamb could we include it in the release so we don't need to break the API later on? |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
30090c1 to
81e756c
Compare
| norm_level: cdc.norm_level, | ||
| } | ||
| ), | ||
| content_defined_chunking: Some(protobuf::CdcOptions { |
There was a problem hiding this comment.
Nice work wiring this through. One small maintainability idea: the CdcOptions to proto field mapping now appears here and in proto/src/logical_plan/file_formats.rs. It might be worth adding a small helper conversion, such as impl From<&CdcOptions> for protobuf::CdcOptions if that fits the crate boundaries, so future CDC field changes only need one mapping update per direction.
There was a problem hiding this comment.
@kosiew apparently those generate parquet proto types are different in logical_plan/file_formats.rs (coming from proto-models) and in proto-common. The protobuf serialization a bit tangled, let me know if I'm missing something.
5b2209e to
ca4a2bf
Compare
Content-defined chunking (CDC) write options were added in apache#21110 and have not been released yet (current workspace is 53.x; CDC is slated for 54.0.0), so the config and proto surfaces can still be changed freely. This reworks it before it ships. What changes: * Rename the `ParquetOptions` field `use_content_defined_chunking` -> `content_defined_chunking`. * `CdcOptions` becomes a plain `config_namespace!` with an explicit `enabled: bool` field alongside the chunking parameters, and the field is a bare `CdcOptions` (no longer `Option<CdcOptions>`). CDC is on iff `content_defined_chunking.enabled` is true. Add `CdcOptions::enabled()` / `CdcOptions::disabled()` shorthand constructors. * Drop the bespoke `impl ConfigField for CdcOptions` / `impl ConfigField for Option<CdcOptions>` and the `#[expect(clippy::should_implement_trait)]` workaround that backed the old bare-boolean form. Everything is now generated by the macro. * Add an `enabled` field to the proto `CdcOptions` message so the proto <-> config mapping is a direct field copy, dropping the previous presence-encoding and the zero-sentinel fallback for the chunk sizes. Why this is better: * Naming matches parquet-rs. parquet's `WriterProperties` exposes `content_defined_chunking()` / `set_content_defined_chunking(...)` with no `use_` prefix; the field name now lines up across the boundary. * Explicit, not magic. CDC is toggled with a real `content_defined_chunking.enabled = true|false` key instead of a special bare-boolean parse, and setting a chunking parameter no longer silently turns CDC on. * No order-dependence on the SQL side. Format options in `COPY ... OPTIONS` and `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap`, i.e. in non-deterministic order. With a separate `enabled` flag, the flag and the parameters are set independently, so the resolved config never depends on the order in which the keys happen to be applied. * Simpler. No hand-written `ConfigField` impls, no clippy hack, and the proto serialization is a plain field copy in both directions. Tests, generated config docs, and the information_schema snapshot are updated accordingly; a new `parquet_cdc_config.slt` documents the resolution behavior (enable toggle, parameter-does-not-enable, order independence).
ca4a2bf to
66eef56
Compare
Follow-up refinements to the parquet CDC options (all unreleased): * Rename the config struct `CdcOptions` -> `ParquetCdcOptions` for explicitness and consistency with the other parquet sub-option structs (`ParquetColumnOptions`, `ParquetEncryptionOptions`), and to disambiguate from the unrelated "change data capture" meaning of CDC. * Drop the chunk-size validation in the parquet writer path (parquet-rs enforces the bounds) and gate the writer call on the `enabled` flag. * Add `From` conversions for the config <-> parquet-rs and config <-> proto `CdcOptions` mappings, replacing the inline field copies. parquet-rs has no `enabled` flag, so the conversion encodes enabled <-> Option presence. * Rename the proto message to a top-level `ParquetCdcOptions` so config and proto names line up; field tags are unchanged, so the wire format is unaffected. Regenerated prost/pbjson for proto-common and proto-models.
66eef56 to
c7ccb83
Compare
Backport of the follow-up to apache#22632, adapted to branch-54 (which predates the proto-models split and the FromProto/TryFromProto refactor): * Rename the config struct `CdcOptions` -> `ParquetCdcOptions` for explicitness and consistency with the other parquet sub-option structs (`ParquetColumnOptions`, `ParquetEncryptionOptions`), and to disambiguate from the unrelated "change data capture" meaning of CDC. * Drop the chunk-size validation in the parquet writer path (parquet-rs enforces the bounds) and gate the writer call on the `enabled` flag, via `From` conversions between the config type and parquet-rs's `Option<CdcOptions>`. * Add `From` conversions for the config <-> proto `CdcOptions` mapping in proto-common, replacing the inline field copies. * Rename the proto message to `ParquetCdcOptions`; field tags are unchanged, so the wire format is unaffected. Regenerated prost/pbjson for proto-common and proto.
Which issue does this PR close?
Rationale for this change
The CDC options currently work as
use_content_defined_chunking: Option<CdcOptions>with aConfigFieldimpl that accepts a bareuse_content_defined_chunking = true|falseand otherwise enables CDC implicitly when any sub-field is set. This has a few problems:WriterPropertiesexposescontent_defined_chunking()/set_content_defined_chunking(Option<CdcOptions>)with nouse_prefix.COPY ... OPTIONS/CREATE EXTERNAL TABLE ... OPTIONSare applied from aHashMap(non-deterministic order). With the old bare-boolean form, mixing... = falsewith a sub-field, or setting a sub-field after= false, could resolve to enabled or disabled depending on iteration order.impl ConfigField for CdcOptions+impl ConfigField for Option<CdcOptions>and a#[expect(clippy::should_implement_trait)]workaround, plus a zero-sentinel fallback in the proto mapping.Since CDC is unreleased, the config/proto surface can still be changed freely.
What changes are included in this PR?
ParquetOptionsfielduse_content_defined_chunking->content_defined_chunking(matches parquet-rs).CdcOptionsa plainconfig_namespace!with an explicitenabled: boolfield alongside the chunking parameters; the field is a bareCdcOptions(no longerOption<CdcOptions>). CDC is on ifcontent_defined_chunking.enabledis true. Setting a parameter no longer implicitly enables CDC, and the result is independent of key order.CdcOptions::enabled()/CdcOptions::disabled()shorthand constructors.ConfigFieldimpls and theshould_implement_traitworkaround — all generated by the macro now.enabledfield to the protoCdcOptionsmessage so the proto <-> config mapping is a plain field copy in both directions (removes the presence-encoding and the zero-sentinel fallback).information_schemasnapshot, and addparquet_cdc_config.sltdocumenting the resolution behavior.Are these changes tested?
Yes:
datafusion-commonconfig + writer unit tests (enable toggle, parameter-does-not-enable, validation, writer round-trip).datafusion-proto-commonproto round-trip tests (enabled / disabled / negative norm level).datafusion/coreparquet integration tests (data round-trip, page boundaries).parquet_cdc.slt(end-to-end) and a newparquet_cdc_config.slt(config resolution / order independence).Are there any user-facing changes?
Yes, but only to the unreleased CDC options:
datafusion.execution.parquet.use_content_defined_chunking->datafusion.execution.parquet.content_defined_chunking.enabled(plus.min_chunk_size/.max_chunk_size/.norm_level).content_defined_chunking.enabled = true|false.No released API is affected.
🤖 Generated with Claude Code